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Low-Latency Software Polar Decoders 

Pascal Giard, Gabi Sarkis, Camille Leroux, Claude Thibeault, and Warren J. Gross 


Abstract —Polar codes are a new class of capacity-achieving 
error-correcting codes with low encoding and decoding com¬ 
plexity. Their low-complexity decoding algorithms rendering 
them attractive for use in software-defined radio applications 
where computational resources are limited. In this work, we 
present low-latency software polar decoders that exploit modern 
processor capabilities. We show how adapting the algorithm at 
various levels can lead to significant improvements in latency 
and throughput, yielding polar decoders that are suitable for 
high-performance software-defined radio applications on modern 
desktop processors and embedded-platform processors. These 
proposed decoders have an order of magnitude lower latency and 
memory footprint compared to state-of-the-art decoders, while 
maintaining comparable throughput. In addition, we present 
strategies and results for implementing polar decoders on graphi¬ 
cal processing units. Finally, we show that the energy efficiency of 
the proposed decoders is comparable to state-of-the-art software 
polar decoders. 

Index Terms —Polar codes, successive-cancellation decoding, 
software decoders 


1. Introduction 

I N software-defined radio (SDR) applications, researchers 
and engineers have yet to fully harness the error-correction 
capability of modern codes due to their high computational 
complexity. Many are still using classical codes [1], [2] 
as implementing low-latency high-throughput—exceeding 10 
Mbps of information throughput—software decoders for turbo 
or low-density parity-check (LDPC) codes is very challenging. 
The irregular data access patterns featured in decoders of 
modern error-correction codes make efficient use of single¬ 
instruction multiple-data (SIMD) extensions present on today’s 
central processing units (CPUs) difficult. To overcome this 
difficulty and still achieve a good throughput, software de¬ 
coders resorting to inter-frame parallelism (decoding multiple 
independent frames at the same time) are often proposed [3]- 
[5]. Inter-frame parallelism comes at the cost of higher latency, 
as many frames have to be buffered before decoding can be 
started. Even with a split layer approach to LDPC decoding 
where intra-frame parallelism can be applied, the latency 
remains high at multiple milliseconds on a recent desktop 
processor [6]. This work presents software polar decoders 
that enable SDR systems to utilize powerful and fast error- 
correction. 

Polar codes provably achieve the symmetric capacity of 
memoryless channels [7]. Moreover they are well suited 
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for software implementation, due to regular memory access 
patterns, on both x86 and embedded processors [8]-[10]. To 
achieve higher throughput and lower latency on processors, 
software polar decoders can also exploit SIMD vector exten¬ 
sions present on today’s CPUs. Vectorization can be performed 
intra-frame [8] or inter-frame [9], [10], with the former having 
lower decoding latency as it does not require multiple frames 
to start decoding. 

In this work, we explore intra-frame vectorized polar de¬ 
coders. We propose architectures and optimization strategies 
that lead to the implementation of high-performance software 
polar decoders tailored to different processor architectures with 
decoding latency of 26 /Ts for a (32768, 29492) polar code, a 
significant performance improvement compared to that of our 
previous work [8]. We start Section II with a review of the 
construction and decoding of polar codes. We then present two 
different software decoder architectures with varying degrees 
of specialization in Section III. Implementation and results on 
an embedded processor are discussed in Section IV. We also 
adapt the decoder to suit graphical processing units (GPUs), 
an interesting target for applications where many hundreds 
of frames have to be decoded simultaneously, and present 
the results in Section V. Finally, Section VI compares the 
energy consumption of the different decoders and Section VIII 
concludes the paper. 

This paper builds upon the work published in [8] and 
[11]. It provides additional details on the approach as well 
as more experimental results for modern desktop processors. 
Both floating- and fixed-point implementations for the final 
desktop CPU version—the unrolled decoder—were further 
optimized leading to an information throughput of up to 1.4 
Gbps. It also adds results for the adaptation of our strategies to 
an embedded processor leading to a throughput and latency of 
up to 2.25 and 36 times better, respectively, compared to that 
of the state-of-the-art software implementation. Compared to 
the state of the art, both the desktop and embedded processor 
implementations are shown to have one to two orders of mag¬ 
nitude smaller memory footprint. Lastly, strategies and results 
for implementing polar decoders on a graphical processing 
unit (GPU) are presented for the first time. 

II. Polar Codes 

A. Construction of Polar Codes 

Polar codes exploit the channel polarization phenomenon to 
achieve the symmetric capacity of a memoryless channel as 
the code length increases {N —> oo). A polarizing construction 
where V = 2 is shown in Fig. la. The probability of correctly 
estimating bit increases compared to when the bits are 
transmitted without any transformation over the channel W. 
Meanwhile, the probability of correctly estimating bit uq 
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(a) N = 2 (b) A? = 4 


Fig. 1: Construction of polar codes of lengths 2 and 4 


decreases. The polarizing transformation can be combined 
recursively to create longer codes, as shown in Fig. lb for 
= 4. As the A —> oo, the probability of successfully 
estimating each bit approaches either 1 (perfectly reliable) or 
0.5 (completely unreliable), and the proportion of reliable bits 
approaches the symmetric capacity of W [7]. 

To construct an (A, k) polar code, the N — k least reliable 
bits, called the frozen bits, are set to zero and the remaining 
k bits are used to carry information. The frozen bits of an (8, 
5) polar code are indicated in gray in Fig. 2a. The locations 
of the information and frozen bits are based on the type and 
conditions of W. In this work we use polar codes constructed 
according to [12]. The generator matrix, G^, for a polar code 
of length A can be specihed recursively so that Gn - = 

where F 2 = [} j j and ® is the Kronecker power. For 
example, for A = 4, Gm is 
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In [7], bit-reversed indexing is used, which changes the 
generator matrix by multiplying it with a bit-reversal operator 
B, so that G = BF. In this work, natural indexing is used as 
it yields more efficient software decoders [8]. 

B. Tree Representation of Polar Codes 

A polar code of length A is the concatenation of two 
constituent polar codes of length [7]. Therefore, binary 
trees are a natural representation of polar codes [13]. Fig. 2 
illustrates the tree representation of an (8, 5) polar code. 
In Fig. 2a, the frozen bits are labeled in gray while the 
information bits are in black. The corresponding tree, shown 
in Fig. 2b, uses white and black leaf nodes to denote these 
bits, respectively. The gray nodes of Fig. 2b correspond to 
concatenation operations shown in Fig. 2a. 


C. Successive-Cancellation Decoding 

In successive-cancellation (SC) decoding, the decoder tree is 
traversed depth first, selecting left edges before backtracking to 
right ones, until the size-1 frozen and information leaf nodes. 
The messages passed to child nodes are log-likelihood ratios 
(LLRs); while those passed to parents are bit estimates. These 
messages are denoted a and j3, respectively. Messages to a left 




Fig. 2: The graph and tree representation of an (8,5) polar 
code. 


child / are calculated by the / operation using the min-sum 
algorithm; 

«;[!] =/(«,,[/], a,,[i-HWv/2]) 

= sgn(a,,[/])sgn(a,,[! - 1 - Nfi]) min(|a,,[/]|, |a,,[/ -h 

( 1 ) 

where A,, is the size of the corresponding constituent code and 
a,, the LLR input to the node. 

Messages to a right child are calculated using the g opera¬ 
tion 

«.[!] = g(«,.[/], a,[i + ^2], j3;[/]) 

!«,,[/ + F'/ 2 \ + «,.[!'], when Pili] = 0; (2) 

1«,,[! -H F,/ 2 \ - «,.[!'], otherwise, 

where j3; is the bit estimate from the left child. 

Bit estimates at the leaf nodes are set to zero for frozen 
bits and are calculated by performing threshold detection for 
information ones. After a node has the bit estimates from both 
its children, they are combined to generate the node’s estimate 
that is passed to its parent 

or-i when/<w./2; 

P"['] = )or- ,, / 1 u • 

\pr[i-Fp], Otherwise, 

where © is modulo-2 addition (XOR). 

D. Simplified Successive-Cancellation Decoding 

Instead of traversing a sub-tree whose leaves all correspond 
to frozen or information bits, simplihed successive-cancella¬ 
tion (SSC) applies a decision rule immediately [13]. For fro¬ 
zen sub-trees, the output is set to the zero vector; while for 
information sub-tree the maximum-likelihood (ML) output is 
obtained by performing element-wise threshold detection on 
the soft-information input vector, a,,. This shrinks the decoder, 
reducing the number of calculations and increasing decoding 
speed. The SC and SSC pruned tree corresponding to an (8,5) 
polar code are shown in Fig. 3a and Fig. 3b, respectively. 

E. The Fast-SSC Decoding Algorithm 

The Fast-SSC decoding algorithm further prunes the deco¬ 
der tree by applying low-complexity decoding rules when 
encountering certain types of constituent codes. These special 
cases are; 
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(c) Fast-SSC 


Fig. 3: Decoder trees corresponding to the SC, SSC and Fast- 
SSC decoding algorithms. 


Repetition codes: are constituent codes where only the last 
bit is an information bit. These codes are efficiently decoded 
by calculating the sum of the input LLRs and using threshold 
detection to determine the result that is then replicated to form 
the estimated bits : 


Mi] 


0, when 
1, otherwise. 


where Nv is the number of leaf nodes. 

Single-parity-check (SPC) codes: are constituent codes 
where only the hrst bit is frozen. The corresponding node 
is indicated by the cross-hatched orange pattern in Fig. 3c. 
The hrst step in decoding these codes is to calculate the hard 
decision of each LLR 


jo, when > 0; 
11, otherwise. 


and then calculating the parity of these decisions 


(4) 


AT,,-! 

parity = ^ A’[*]- 

i '=0 

If the parity constraint is unsatished, the estimate of the bit 
with the smallest LLR magnitude is hipped; 


/3,.[/] = (3,,[i] © parity, where i - arg min(|Q;v,[/]|). 

i 

Repetition-SPC codes: are codes whose left constituent 
code is a repetition code and the right an SPC one. They 
can be speculatively decoded in hardware by simultaneously 
decoding the repetition code and two instances of the SPC 
code: one assuming the output of the repetition code is all 
O’s and the other all I’s. The correct result is selected once 
the output of the repetition code is available. This speculative 
decoding also provides speed gains in software. 


conhgurable at run time and the second one—the unrolled 
decoder—^presents a fully unrolled, branchless decoder fully 
exploiting SIMD vectorization. In the second version of the 
decoder, compile-time optimization plays a signihcant role in 
the performance improvements. Performance is evaluated for 
both the instruction-based and unrolled decoders. 

It should be noted that, contrary to what is common in 
hardware implementations e.g. [14], [15], natural indexing 
is used for all software decoder implementations. While bit- 
reversed indexing is well-suited for hardware decoders, SIMD 
instructions operate on independent vectors, not adjacent val¬ 
ues within a vector. Using bit-reverse indexing would have 
mandated data shuffling operations before any vectorized op¬ 
eration is performed. 

Both versions, instruction-based decoders and unrolled de¬ 
coders, use the following functions from the Fast-SSC algo¬ 
rithm [14]: F, G, G_0R, Combine, Combine_0R, Repetition, 
OSPC, RSPC, RepSPC and P_0L An Info function implement¬ 
ing eq. (4) is also added. 

Methodology for the Experimental Results: We discuss 
throughput in information bits per second as well as latency. 
Our software was compiled using the C++ compiler from 
GCC 4.9 using the flags “-march=native -funroll-loops 
-Ofast”. Additionally, auto-vectorization is always kept en¬ 
abled. The decoders are inserted in a digital communication 
chain to measure their speed and to ensure that optimizations, 
including those introduced by 

-Ofast, do not affect error-correction performance. In the 
simulations, we use binary phase shift keying (BPSK) over 
an AWGN channel with random codewords. 

The throughput is calculated using the time required to 
decode a frame averaged over 10 runs of 50,000 and 10,000 
frames each for the N - 2048 and the N > 2048 codes, 
respectively. The time required to decode a frame, or latency, 
also includes the time required to copy a frame to decoder 
memory and copy back the estimated codeword. Time is 
measured using the high precision clock provided by the Boost 
Chrono library. 

In this work we focus on decoders running on one processor 
core only since the targeted application is SDR. Typically, an 
SDR system cannot afford to dedicate more than a single core 
to error-correction as it has to perform other functions simul¬ 
taneously. For example, in SDR implementations of long term 
evolution (LTE) receivers, the orthogonal frequency-division 
multiplexing (OFDM) demodulation alone is approximately 
an order of magnitude more computationally demanding than 
the error-correction decoder [1], [2], [16]. 


Fig. 3c shows the tree corresponding to a Fast-SSC decoder 
is will be described more thoroughly in Section III-B. Other 
types of operations are introduced in the Fast-SSC algorithm, 
we refer the reader to [14] for more details. 

III. Implementation on x86 Processors 

In this section we present two different versions of the 
decoder in terms of increasing design specialization for soft¬ 
ware; whereas the first version—the instruction-based deco¬ 
der—takes advantage of the processor architecture it remains 


A. Instruction-based Decoder 

The Fast-SSC decoder implemented on a field-programma¬ 
ble gate array (FPGA) in [14] closely resembles a CPU with 
wide SIMD vector units and wide data buses. Therefore, it 
was natural to use the same design for a software decoder, 
leveraging SIMD instructions. This section describes how the 
algorithm was adapted for a software implementation. As 
fixed-point arithmetic can be used, the effect of quantization 
is shown. 
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Fig. 4; Effect of quantization on error-correction performance. 


1) Using Fixed-Point Numbers: On processors, fixed-point 
numbers are represented with at least 8 bits. As illustrated 
in Fig. 4, using 8 bits of quantization for LLRs results in 
a negligible degradation of error-correction performance over 
a floating-point representation. At a frame-error rate (PER) 
of 10“^ the performance loss compared to a floating-point 
implementation is less than 

0.025 dB for the (32768, 27568) polar code. With custom 
hardware, it was shown in [14] that 6 bits are sufhcient for 
that polar code. It should be noted that in Fast-SSC decoding, 
only the G function adds to the amplitude of LLRs and it is 
carried out with saturating adders. 

With instructions that can work on registers of packed 8- 
bit integers, the SIMD extensions available on most general- 
purpose x86 and ARM processors are a good ht to implement 
a polar decoder. 

2 ) Vectorizing the Decoding of Constituent Codes: On x86- 
64 processors, the vector instructions added with SSE support 
logic and arithmetic operations on vectors containing either 4 
single-precision floating-point numbers or 16 8-bit integers. 
Additionally, x86-64 processors with AVX instructions can 
operate on data sets of twice that size. Below are the operations 
benehting the most from explicit vectorization. 

F: the / operation (1) is often executed on large vectors 
of LLRs to prepare values for other processing nodes. The 
min() operation and the sign calculation and assignment are 
all vectorized. 

G and G_QR: the g operation is also frequently executed on 
large vectors. Both possibilities, the sum and the difference, 
of (2) are calculated and are blended together with a mask to 
build the result. The G_0R operation replaces the G operation 
when the left hand side of the tree is the all-zero vector. 

Combine and Combine_0R: the Combine operation com¬ 
bines two estimated bit-vectors using an XOR operation in a 
vectorized manner. The Combine_0R operation is to Combine 
what G_0R is to G. 

SPC decoding-, locating the LLR with the minimum mag¬ 
nitude is accelerated using SIMD instructions. 


3) Data Representation: For the decoders using floating¬ 
point numbers, the representation of j3 is changed to accelerate 
the execution of the g operation on large vectors. Thus, when 
floating-point LLRs are used, j3/[i] e {-Hi, -1) instead of (0,1). 
As a result, (2) can be rewritten as 

g(ai,[/], «„[( -H ^v/2], ) 3 ,[/]) = «,[/] * Pili] + «,,[/ -H Nf2]. 

This removes the conditional assignment and turns g() into 
a multiply-accumulate operation, which can be performed 
efficiently in a vectorized manner on modem CPUs. For 
integer LLRs, multiplications cannot be carried out on 8-bit 
integers. Thus, both possibilities of (2) are calculated and 
are blended together with a mask to build the result. The 
Combine operation is modihed accordingly for the floating¬ 
point decoder and is computed using a multiplication with 
j3,[/] 6{+l,-l). 

4) Architecture-specific Optimizations: The decoders take 
advantage of the SSSE 3, SSE 4.1 and AVX instructions when 
available. Notably, the sign and abs instractions from SSSE 3 
and the blendv instruction from SSE 4.1 are used. AVX, with 
instructions operating on vectors of 256 bits instead of the 128 
bits, is only used for the floating-point implementation since 
it does not support integer operations. Data was aligned to the 
128 (SSE) or 256-bit (AVX) boundaries for faster accesses. 

5) Implementation Comparison: Here we compare the per¬ 
formance of three implementations. First, a non-explicitly 
vectorized version using floating-point numbers. Second an 
explicitly vectorized version using floating-point numbers. 
Third, the explicitly vectorized version using a hxed-point 
number representation. In Table I, they are denoted as Float, 
SIMD-Float and SIMD-int8 respectively. 

Results for decoders using the floating-point number repre¬ 
sentation are included as the efhcient implementation makes 
the resulting throughput high enough for some applications. 
The decoders ran on a single core of an Intel Core i7-4770S 
clocked at 3.1 GHz with Turbo disabled. 

Comparing the throughput and latency of the Float and 
SIMD-Float implementations in Table I conhrms the benehts 
of explicit vectorization in this decoder. The performance of 
the SIMD-Float implementation is only 21% to 38% slower 
than the SIMD-int8 implementation. This is not a surprising 
result considering that the SIMD-Float implementation uses 
the AVX instructions operating on vectors of 256 bits while 
the SIMD-int8 version is limited to vectors of 128 bits. Table I 
also shows that vectorized implementations have 3.6 to 5.8 
times lower latency than the floating-point decoder. 

B. Unrolled Decoder 

The goal of this design is to increase vectorization and 
inlining and reduce branches in the resulting decoder by 
maximizing the information specihed at compile-time. It also 
gets rid of the indirections that were required to get good 
performance out of the instruction-based decoder. 

1) Generating an Unrolled Decoder: The polar codes de¬ 
coded by the instruction-based decoders presented in Sec¬ 
tion III-A can be specihed at run-time. This Hexibility comes at 
the cost of increased branches in the code due to conditionals. 
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TABLE I: Decoding polar codes with the instruction-based 
decoder. 


Code 

{N,k) 

Implementation 

Info T/P 
(Mbps) 

Latency 

(ps) 

(2048,1024) 

Float 

20.8 

49 


SIMD-Float 

75.6 

14 


SIMD-int8 

121.7 

8 

(2048,1707) 

Float 

41.5 

41 


SIMD-Float 

173.9 

10 


SIMD-int8 

209.9 

8 

(32768,27568) 

Float 

32.4 

825 


SIMD-Float 

124.3 

222 


SIMD-int8 

175.1 

157 

(32768,29492) 

Float 

40.8 

723 


SIMD-Float 

160.1 

184 


SIMD-int8 

198.6 

149 


Listing 1 Unrolled (8, 5) Fast-SSC Decoder 

F<8>(ao tti); 

G_0R<4>(ai, a2); 
lnfo<2>(a2, j 3 i); 

Combine_0R<4>(j3i, )32); 

G<8>(ac, <X2, ft); 

SPC<4>(a2, ft); 

Combine<8>(ft, ft, ft); 


indirections and loops. Creating a decoder dedicated to only 
one polar code enables the generation of a branchless fully- 
unrolled decoder. In other words, knowing in advance the 
dimensions of the polar code and the frozen bit locations 
removes the need for most of the control logic and eliminates 
branches there. 

A tool was built to generate a list of function calls corre¬ 
sponding to the decoder tree traversal. It was first described 
in [11] and has been significantly improved since its initial 
publication notably to add support for other node types as 
well as to add support for GPU code generation. Listing 1 
shows an example decoder that corresponds to the (8, 5) polar 
code whose dataflow graph is shown in Fig. 5. For brevity 
and clarity, in Fig. 5b, I and C_0R correspond to the Info and 
Combine_0R functions, respectively. 

2) Eliminating Superfluous Operations on fi-Values: Every 
non-leaf node in the decoder performs the combine operation 
(3), rendering it the most common operation. In (3), half the j3 
values are copied unchanged to j3,.. One method to significantly 
reduce decoding latency is to eliminate those superfluous copy 
operations by choosing an appropriate layout for j3 values in 
memory: Only N j3 values are stored in a contiguous array 
aligned to the SIMD vector size. When a combine operation 
is performed, only those values corresponding to ft will be 
updated. Since the stage sizes are all powers of two, stages 
of sizes equal to or larger than the SIMD vector size will be 
implicitly aligned so that operations on them are vectorized. 

3) Improved Layout of the a-memory: Unlike in the case 
of p values, the operations producing a values, / and g 
operations, do not copy data unchanged. Therefore, it is 
important to maximize the number of vectorized operations 
to increase decoding speed. To this end, contiguous memory 




Fig. 5: Dataflow graph of a (8,5) polar decoder. 


Listing 2 Finding the index of a given value in a vector 

std::uint32_t findldx(0!* x, a Xn,in) { 

_mm256 minVec = _mm256_broadcastb_epi8(Xn,in); 

_mm256 mask = _mm256_cmpeq_epi8(minVec, x); 

std::uint32_t mvMask = _mm256_movemask_epi8(mask); 
return _tzcnt_u32(mvMask); 

) 


is allocated for the log 2 A^ stages of the decoder. The overall 
memory and each stage is aligned to 16 or 32-byte boundaries 
when SSE or AVX instructions are used, respectively. As 
such, it becomes possible to also vectorize stages smaller 
than the SIMD vector size. The memory overhead due to 
not tightly packing the stages of a memory is negligible. As 
an example, for an A = 32,768 floating-point polar decoder 
using AVX instructions, the size of the a memory required 
by the proposed scheme is 262,208 bytes, including a 68-byte 
overhead. 

4) Compile-time Specialization: Since the sizes of the con¬ 
stituent codes are known at compile time, they are provided as 
template parameters to the functions as illustrated in Listing 1. 
Each function has two or three implementations. One is for 
stages smaller than the SIMD vector width where vectorization 
is not possible or straightforward. A second one is for stages 
that are equal or wider than the largest vectorization instruction 
set available. Finally, a third one provides SSE vectorization 
in an AVX or AVX2 decoder for stages that can be vectorized 
by the former, but are too small to be vectorized using AVX or 
AVX2. The last specialization was noted to improve decoding 
speed in spite of the switch between the two SIMD extension 
types. 

Furthermore, since the bounds of loops are compile-time 
constants, the compiler is able to unroll loops where it sees fit, 
eliminating the remaining branches in the decoder unless they 
help in increasing speed by resulting in a smaller executable. 

5) Architecture-specific Optimizations: First, the decoder 
was updated to take advantage of AVX2 instructions when 
available. These new instructions benefit the fixed-point imple¬ 
mentation as they allow simultaneous operations on 32 8-bit 
integers. 

Second, the implementation of some nodes were hand- 
optimized to better take advantage of the processor archi¬ 
tecture. For example, the SPC node was mostly rewritten. 
Listing 2 shows a small but critical subsection of the SPC 
node calculations where the index within a SIMD vector 
corresponding to the specified value is returned. The reduction 
operation required by the Repetition node has also been 
optimized manually. 
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Listing 3 Vectorized floating-point G function (g operation) 

template<unsigned int Ny> 
void G(a* a,„, a' a„„„ jS* ft„) { 
for (unsigned int i = 0; i < ^v/ 2 ; i += 8) { 

_m256 «; = _mm256_load_ps(ai„ + i); 

_m256 Ctr = _mm256_load_ps(ai„ + i + 

_m256 j3,, = _mm256_load_ps(jS„ - 1 - i); 

_m256 a'l = _mm256_xor_ps(j3,„ «;); 

_m256 ce,, = _mm256_add_ps(ar, «/); 

_mm256_store_ps(0!o„, + i, a,,); 

} 

) 


TABLE II; Decoding polar codes with floating-point precision 
using SIMD, comparing the instruction-based decoder (ID) 
with the unrolled decoder (UD). 


Code 

Info T/P (Mbps) 

Latency (^s) 

(NA) 

ID 

UD 

ID 

UD 

(2048,1024) 

75.6 

229.8 

14 

4 

(2048,1707) 

173.9 

492.2 

10 

3 

(32768,27568) 

124.3 

271.3 

222 

102 

(32768,29492) 

160.1 

315.1 

184 

94 


Listing 4 Vectorized 8-bit fixed-point G function (g operation) 

static const _m256i ONE = _mm256_setl_epi8(l); 

static const _m256i M127 = _mm256_setl_epi8(-127); 

teniplate<unsigned int Ny> 
void G(a* a,„, a* a„„„ jS* ftj { 
for (unsigned int i = 0; i < ^v/ 2 ; i += 32) { 

_m256i = _mm256_load_si256(ai„ + i); 

_m256i Or = _mm256_load_si256(a,„ -t i - 1 - ^>-/ 2 ); 

_m256i jSj- = _mm256_load_si256(ft„ - 1 - i); 

_m256i P' = _mm256_or_si256(jS,,, ONE); 

_m256i a'l = _mm256_sign_epi8(a;, P',); 

_m256i Oy = _mm256_add_ps(ar, C</); 

_m256i = _mm256_max_epi8(M127, a„); 

_mm256_store_si256(aouf + i> cc')'. 



Third, for the floating-point implementation, j3 was chan¬ 
ged to be in {-hO,- 0) instead of {-i-l,-l). In the floating¬ 
point representation [17], the most significant bit only carries 
the information about the sign. Flipping this bit effectively 
changes the sign of the number. By changing the mapping 
for p, multiplications are replaced by faster bitwise XOR 
operations. Similarly, for the 8-bit fixed-point implementation, 
P was changed to be in (0,-128) to reduce the complexity of 
the Info and G functions. 

Listings 3 and 4 show the resulting G functions for both 
the floating-point and fixed-point implementations as examples 
illustrating bottom-up optimizations used in our decoders. 

6) Memory Footprint: The memory footprint is considered 
an important constraint for software applications. Our pro¬ 
posed implementations use 2 contiguous memory blocks that 
correspond to the a and p values, respectively. The size of 
the j3-memory is 

Mp^NWp, (5) 

where N is the frame length, Wp is the number of bits used 
to store a j3 value and is in bits. 

The size of the a-memory can be expressed as 



where N is the frame length, Wa is the number of bits used to 
store an a value, A is the number of a values per SIMD vector 
and Ma is in bits. Note that the expression of Ma contains the 
expression for the overhead Maon due to tightly packing the 


a values as described in Section III-B3: 

/logj(A)-l Y 

M„oH=Alog2A- ^ 2 ' 1T„. (7) 

V /=0 /. 

The memory footprint can thus be expressed as 

Mtotal =Mp+Ma 

'log2(A)-l 

(2V-l)-HAlog 2 A- Yj 2' 

(=0 

The memory footprint in kilobytes can be approximated with 

N(Wp + 2Wa) 

Af total (kbytes) ~ 8000 

7) Implementation Comparison: We first compare the 
SIMD-float results for this implementation—the unrolled 
decoder—with those from Section III-A—the instruction- 
based decoder. Then we show SIMD-int8 results and compare 
them with that of the software decoder of Le Gal et. al [10]. 
As in the previous sections, the results are for an Intel Core 
i7-4770S running at 3.1 GHz when Turbo is disabled and at up 
to 3.9 GHz otherwise. The decoders were limited to a single 
CPU core. 

Table II shows the impact of the optimizations introduced 
in the unrolled version on the SIMD-float implementations. 
It resulted in the unrolled decoders being 2 to 3 times faster 
than the flexible, instruction-based, ones. Comparing Tables I 
and II shows an improvement factor from 3.3 to 5.7 for the 
SIMD-int8 implementations. It should be noted that some of 
the improvements introduced in the unrolled decoders could be 
backported to the instruction-based decoders, and is considered 
for future work. 

Compared to the software polar decoders of [10], Table III 
shows that our throughput is lower for short frames but can be 
comparable for long frames. However, latency is an order of 
magnitude lower for all code lengths. This is to be expected as 
the decoders of [10] do inter-frame parallelism i.e. parallelize 
the decoding of independent frames while we parallelize the 
decoding of a frame. The memory footprint of our decoder 
is shown to be approximately 24 times lower than that of 
[10]. The results in [10] were presented with Turbo frequency 
boost enabled; therefore we present two sets of results for our 
proposed decoder: one with Turbo enabled, indicated by the 
asterisk (*) and the 3.1-1- GHz frequency in the table, and one 
with Turbo disabled. The results with Turbo disabled are more 
indicative of a full SDR system as all CPU cores will be fully 
utilized, not leaving any thermal headroom to increase the 



^NWp + 
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TABLE III: Comparison of the proposed software decoder with that of [10]. 


Decoder 

Target 

L3 

Cache 

/ 

(GHz) 

Code 

(Nff) 

Mem. footprint 
(kbytes) 

Info T/P 
(Mbps) 

Latency 

(Ps) 

[10]* 

Intel Core i7-4960HQ 

6MB 

3.6+ 

(2048,1024) 

144 

1,320 

25 





(2048,1707) 

144 

2,172 

26 





(32768,27568) 

2304 

1,232 

714 





(32768,29492) 

2304 

1,557 

605 

this work 

Intel Core 17-47708 

SMB 

3.1 

(2048,1024) 

6 

398 

3 





(2048,1707) 

6 

1,041 

2 





(32768,27568) 

98 

886 

31 





(32768,29492) 

98 

1,131 

26 

this work* 

Intel Core 17-47708 

SMB 

3.1+ 

(2048,1024) 

6 

502 

2 





(2048,1707) 

6 

1,293 

1 





(32768,27568) 

98 

1,104 

25 





(32768,29492) 

98 

1,412 

21 


^Results with Turbo enabled. 


TABLE IV: Effect of unrolling and algorithm choice on 
decoding speed of the (2048, 1707) code on the Intel Core 
i7-4770S 


Decoder 

Info T/P (Mbps) 

Latency (^s) 

ID 

210 

8.1 

UD SC 

363 

4.7 

UD Fast-SSC 

1041 

1.6 


frequency. The maximum Turbo frequencies are 3.8 GHz and 
3.9 GHz for the i7-4960HQ and i7-4770S CPUs, respectively. 

Looking at the first two, or last two rows of Table II, it 
can be seen that for a fixed code length, the decoding latency 
is smaller for higher code rates. The tendency of decoding 
latency to decrease with increasing code rate and length was 
first discussed in [18]. It was noted that higher rate codes 
resulted in SSC decoder trees with fewer nodes and, therefore, 
lower latency. Increasing the code length was observed to have 
a similar, but lesser, effect. However, once the code becomes 
sufficiently long, the limited memory bandwidth and number 
of processing resources form bottlenecks that negate the speed 
gains. 

The effects of unrolling and using the East-SSC algorithm 
instead of SC are illustrated in Table IV. It can be observed that 
unrolling the East-SSC decoder results in a 5 time decrease in 
latency. Using the East-SSC instead of SC decoding algorithm 
decreased the latency of the unrolled decoder by 3 times. 

IV. Implementation on Embedded Processors 

Many of the current embedded processors used in SDR 
applications also offer SIMD extensions, e.g. NEON for ARM 
processors. All the strategies used to develop an efficient 
x86 implementation can be applied to the ARM architecture 
with changes to accommodate differences in extensions. Eor 
example, on ARM, there is no equivalent to the movemask 
SSE/AVX x86 instruction. 

The equations for the memory footprint provided in Sec¬ 
tion III-B6 also apply to our decoder implementation for 
embedded processors. 

Comparison with Similar Works: Results were obtained 
using the ODROID-U3 board, which features a Samsung 
Exynos 4412 system on chip (SoC) implementing an ARM 


TABLE V: Decoding polar codes with 8-bit fixed-point num¬ 
bers on an ARM Cortex A9 using NEON. 


Code 

(Nff) 

Decoder 

Mem. 

Footprint 

(kBvtes) 

T/P (Mbps) 

Latency 

(Ps) 

Coded 

Info 

(1024,512) 

[9] 

38 

70.5 

35.3 

232 


[9]* 

38 

80.6 

42.9 

191 


this work 

3 

113.1 

56.6 

9 

(32768,29492) 

[9] 

1,216 

33.1 

29.8 

15,844 


[9]* 

1,216 

40.2 

36.2 

13,048 


this work 

98 

90.8 

81.7 

361 


^Results linearly scaled for the clock frequency difference. 


Cortex A9 clocked at 1.7 GHz. Like in the previous sections, 
the decoders were only allowed to use one core. Table V 
shows the results for the proposed unrolled decoders and 
provides a comparison with [9]. As with their desktop CPU 
implementation of [10], inter-frame parallelism is used in the 
latter. 

It can be seen that the proposed implementations provide 
better latency and greater throughput at native frequencies. 
Since the ARM CPU in the Samsung Exynos 4412 is clocked 
at 1.7 GHz while that in the NVIDIA Tegra 3 used in [9] is 
clocked at 1.4 GHz, we also provide linearly scaled throughput 
and latency numbers for the latter work, indicated by an 
asterisk (*) in the table. Compared to the scaled results of 
[9], the proposed decoder has 1.4-2.25 times the throughput 
and its latency is 25-36 times lower. The memory footprint 
of our proposed decoder is approximately 12 times lower than 
that of [9]. Both implementations are using 8-bit fixed-point 
values. 

V. Implementation on Graphical Processing Units 

Most recent graphical processing units (GPU) have the capa¬ 
bility to do calculations that are not related to graphics. These 
GPUs are often called general purpose GPUs (GPGPU). In 
this section, we describe our approach to implement software 
polar decoders in CUDA C [19] and present results for these 
decoders running on a NVIDIA Tesla K20c. 

Most of the optimization strategies cited above could be 
applied or adapted to the GPU. However, there are noteworthy 
differences. Note that, when latency is mentioned below we 
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Fig. 6: Effect of the number of threads per block on the 
information throughput and decoding latency for a (1024,922) 
polar code where the number of blocks per kernel is 208. 



Blocks per kernel 

Fig. 7: Effect of the number of blocks per kernel on the data 
transfer and kernel execution latencies for a (2048,1707) polar 
code where the number of threads per block is 128. 


refer to the decoding latency including the delay required to 
copy the data in and out of the GPU. 

A. Overview of the GPU Architecture and Terminology 

A NVIDIA GPU has multiple microprocessors with 32 
cores each. Cores within the same microprocessor may com¬ 
municate and share a local memory. However, synchronized 
communication between cores located in different micropro¬ 
cessors often has to go through the CPU and is thus costly 
and discouraged [20]. 

GPUs expose a different parallel programming model than 
general purpose processors. Instead of SIMD, the GPU model 
is single-instruction-multiple-threads (SIMT). Each core is ca¬ 
pable of running a thread. A computational kernel performing 
a specific task is instantiated as a block. Each block is mapped 
to a microprocessor and is assigned one thread or more. 

As it will be shown in Sect. V-C, the latency induced by 
transferring data in and out of a GPU is high. To minimize 
decoding latency and maximize throughput, a combination 
of intra- and inter-frame parallelism is used for the GPU 
contrary to the CPUs where only the former was applied. We 
implemented a kernel that decodes a single frame. Thus, a 
block corresponds to a frame and attributing e.g. 10 blocks to 
a kernel translates into the decoding of 10 frames in parallel. 

B. Choosing an Appropriate Number of Threads per Block 

As stated above, a block can only be executed on one 
microprocessor but can be assigned many threads. However, 
when more than 32 threads are assigned to a block, the threads 
starting at 33 are queued for execution. Queued threads are 
executed as soon as a core is free. 

Fig. 6 shows that increasing the number of threads assigned 
to a block is beneficial only until a certain point is reached. For 
the particular case of a (1024,922) code, associating more than 
128 threads to a block negatively affects performance. This is 
not surprising as the average node width for that code is low 
at 52. 


C. Choosing an Appropriate Number of Blocks per Kernel 

Memory transfers from the host to the GPU device are of 
high throughput but initiating them induces a great latency. 
The same is also true for transfers in the other direction, from 
the device to the host. Thus, the number of distinct transfers 
have to be minimized. The easiest way to do so is to run a 
kernel on multiple blocks. For our application, it translates to 
decoding multiple frames in parallel as a kernel decodes one 
frame. 

Yet, there s a limit to the number of resources that can be 
used to execute a kernel i.e. decode a frame. At some point, 
there will not be enough computing resources to do the work 
in one pass and many passes will be required. The NVIDIA 
Tesla K20c card features the Kepler GKllO GPU that has 13 
microprocessors with 32 cores and 16 load and store units 
each [21]. In total, 416 arithmetic or logic operations and 208 
load or store operations can occur simultaneously. 

Yet, there is a limit to the number of resources that can be 
used to execute a kernel i.e. decode a frame. At some point, 
there will not be enough computing resources to do the work 
in one pass and many passes will be required. The NVIDIA 
Tesla K20c card features the Kepler GKllO GPU that has 13 
microprocessors with 32 cores and 16 load and store units 
each [21]. In total, 416 arithmetic or logic operations and 208 
load or store operations can occur simultaneously. 

Fig. 7 shows the latency to execute a kernel, to transfer 
memory from the host to the GPU and vice versa for a 
given number of blocks per kernel. The number of threads 
assigned per block is fixed to 128 and the decoder is built for 
a (2048,1707) polar code. It can be seen that the latency of 
memory transfers grows linearly with the number of blocks 
per kernel. The kernel latency however has local minimums 
at multiples of 208. We conclude that the minimal decoding 
latency, the sum of all three latencies illustrated in Fig. 7, is 
bounded by the number of load and store units. 

D. On the Constituent Codes Implemented 

Not all the constituent codes supported by the general 
purpose processors are beneficial to a GPU implementation. 
In a SIMT model, reduction operations are costly. Moreover, 
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Blocks per kernel 

Fig. 8: Information throughput comparison for a (1024,922) 
polar code where intermediate results are stored in shared or 
global memory. The number of threads per block is 128. 


if a conditional execution leads to unbalanced threads, perfor¬ 
mance suffers. Consequently, all nodes based on the single- 
parity-check (SPC) codes, that features both characteristics, 
are not used in the GPU implementation. 

Experiments have shown that implementing the SPC node 
results in a throughput reduction by a factor of 2 or more. 

E. Shared Memory and Memory Coalescing 

Each microprocessor contains shared memory that can be 
used by all threads in the same block. The NVIDIA Tesla K20c 
has 48 kfi of shared memory per block. Individual reads and 
writes to the shared memory are much faster than accessing 
the global memory. Thus, intuitively, when conducting the 
calculations within a kernel, it seems preferable to use the 
shared memory as much as possible in place of the global 
memory. 

However, as shown by Eig. 8, it is not always the case. 
When the number of blocks per kernel is small, using the 
shared memory provides a significant speedup. In fact, with 
64 blocks per kernel, using shared memory results in a decoder 
that has more than twice the throughput compared to a kernel 
that only uses the global memory. Past a certain value of blocks 
per kernel though, solely using the global memory is clearly 
advantageous for our application. 

These results suggest that the GPU is able to efficiently 
schedule memory transfers when the number of blocks per 
kernel is sufficiently high. 


TABLE VI: Decoding polar codes on an NVIDIA Tesla K20c. 


Code 

{N^k) 

Nbr of 
Blocks 

Info T/P 
(Mbps) 

Latency 

(ms) 

(1024,922) 

208 

1,022 

0.6 


416 

1,046 

1.1 


624 

1,060 

1.6 


832 

1,070 

2.2 

(2048,1707) 

208 

915 

1.1 


416 

936 

2.2 


624 

953 

3.3 


832 

964 

4.5 

(4096,3686) 

208 

959 

2.6 


416 

1,002 

4.9 


624 

1,026 

6.9 


832 

1,043 

9.4 


the GPU—memory transfers and execution of the kernel can 
be overlapped, effectively multiplying throughput by a factor 
of 3. 

This also increases the memory footprint by a factor of 
three. On the GPU, the memory footprint is 


44total (kbytes) ~ 


+ Wa)BS 
8000 


( 10 ) 


where B is the number of blocks per kernel—i.e. the number 
of frames being decoded simultaneously—, S is the number of 
streams, and where Wp and Wa are the number of bits required 
to store a j3 and an a value, respectively. Eor best performance, 
as detailed in the next section, both j3 and a values are 
represented with floating-point values and thus Wp - Wa - 32. 


G. On the Use of Fixed-Point Numbers on a GPU 

It is tempting to move calculations to 8-bit fixed-point 
numbers in order to speedup performance, just like we did 
with the other processors. However, GPUs are not optimized 
for calculations with integers. Current GPUs only support 32- 
bit integers. Even so, the maximum number of operations per 
clock cycle per multiprocessor as documented by NVIDIA 
[19] clearly shows that integers are third class citizens behind 
single- and double-precision floating-point numbers. As an 
example. Table 2 of [19] shows that GPUs with compute 
capability 3.5—like the Tesla K20c—can execute twice as 
many double-precision floating-point multiplications in a given 
time than it can with 32-bit integers. The same GPU can carry 
on 6 times more floating-point precision multiplications than 
its 32-bit integer counterpart. 


F. Asynchronous Memory Transfers and Multiple Streams 

Transferring memory from the host to the device and vice 
versa induces a latency that can be equal to the execution 
of a kernel. Fortunately, that latency can be first reduced by 
allocating pinned or page-locked host memory. As page-locked 
memory can be mapped into the address space of the device, 
the need for a staging memory is eliminated [19]. 

More significantly, NVIDIA GPUs with compute capability 
of 2.0 or above are able to transfer memory in and out 
of the device asynchronously. By creating three streams— 
sequences of operations that get executed in issue-order on 


//. Results 

Table VI shows the estimated information throughput and 
measured latency obtained by decoding various polar codes 
on a GPU. The throughput is estimated by assuming that 
the total memory transfer latencies are twice the latency 
of the decoding. This has been verified to be a reasonable 
assumption, using NVIDIA’s profiler tool, when the number 
of blocks maximizes throughput. 

Performing linear regression on the results of Table VI 
indicates that the latency scales linearly with the number of 
blocks, leading to standard error values of 0.04, 0.04 and 
0.14 for the (1024,922), (2048,1707) and (4096,3686) polar 
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TABLE VII; Comparison of the power consumption and energy per information bit for the (2048,1707) polar code. 


Decoder 

Target 

Mem. Footprint 
(kbytes) 

Info. T/P 
(Gbps) 

Latency 

(Ps) 

Power 

(W) 

Energy 
(nj/info. bit) 

[10] 

Intel Core i7-4960HQ* 

144 

2.2 

26 

13 

6 

this work 

Intel Core i7-4770S 

6 

1.0 

2 

3 

3 


Intel Core i7-4770S* 

6 

1.3 

1 

5 

4 


ARM Cortex A9 

6 

0.1 

14 

0.8 

7 


NVIDIA Tesla K20c 

3,408+ 

0.9 

1100 

108 

118 


*Results with Turbo enabled. 

^Amount required per stream. Three .'Streams are required to sustain this throughput. 


codes, respectively. In our decoder, a block corresponds to the 
decoding a single frame. The frames are independent of each 
other, and so are blocks. Thus, our decoder scales well with 
the number of available cores. 

Furthermore, looking at Table VI it can be seen that the 
information throughput is in the vicinity of a gigabit per 
second. Experiments have shown that the execution of two 
kernels can slightly overlap, making our throughput results 
of Table VI worst-case estimations. For example, while the 
information throughput to decode 832 frames of a (4096, 3686) 
polar code is estimated at 1,043 Mbps in Table VI, the mea¬ 
sured average value in NVIDIA’s profiler tool was 1,228 Mbps, 
a 18% improvement over the estimated throughput. 

Our experiments have also shown that our decoders are 
bound by the data transfer speed that this test system is capable 
of. The PCIe 2.0 standard [22] specifies a peak data throughput 
of 64 Gbps when 16 lanes are used and once 8b 10b encoding is 
accounted for. Decoding 832 frames of a polar code of length 
N - 4096 requires the transfer of 3,407,872 LLRs expressed 
as 32-bit floating-point numbers for a total of approximately 
109 Mbits. Without doing any computation on the GPU, our 
benchmarks measured an average PCIe throughput of 45 Gbps 
to transfer blocks of data of that size from the host to the 
device and back. Running multiple streams and performing 
calculations on the GPU caused the PCIe throughput to drop 
to 40 Gbps. This corresponds to 1.25 Gbps when 32-bit floats 
are used to represent LLR inputs and estimated-bit outputs of 
the decoder. In light of these results, we conjecture that the 
coded throughput will remain approximately the same for any 
polar code as the PCIe link is saturated and data transfer is 
the bottleneck. 

VI. Energy Consumption Comparison 

In this section the energy consumption is compared for all 
three processor types: the desktop processor, the embedded 
processor and the GPU. Unfortunately the Samsung Exynos 
4412 SoC does not feature sensors allowing for power usage 
measurements of the ARM processor cores. The energy con¬ 
sumption of the ARM processor was estimated from board- 
level measurements. An Agilent E3631A DC power supply 
was used to provide the 5V input to the ODROID-U3 board 
and the current as reported by the power supply was used to 
calculated the power usage when the processor was idle and 
under load. 

On recent Intel processors, power usage can be calculated by 
accessing the Running Average Power Limit (RAPE) counters. 


The LIKWID tool suite [23] is used to measure the power 
usage of the processor. Numbers are for the whole processor 
including the DRAM package. Recent NVIDIA GPUs also 
feature on-chip sensors enabling power usage measurement. 
Steady state values are read in real-time using the NVIDIA 
Management Libray (NVML) [24]. 

Table VII compares the energy per information bit required 
to decode the (2048,1707) polar code. The SIMD-int8 im¬ 
plementation of our unrolled decoder is compared with that 
of the implementation in [10]. The former uses an Intel Core 
i7-4770S clocked at 3.1 GHz. The latter uses an Intel Core i7- 
4960HQ clocked at 3.6 GHz with Turbo enabled. The results 
for the ARM Cortex A9 embedded processor and NVIDIA 
Tesla K20c GPU are also included for comparison. Note that 
the GPU represents LLRs with floating-point numbers. 

The energy per information bit is calculated with 

® W""°- = info. ■ 

It can be seen that the proposed decoder is slightly more en¬ 
ergy efficient on a desktop processor compared to that of [10]. 
For that polar code, the latter offers twice the throughput but at 
the cost of a latency that is at least 13 times greater. However, 
the latter is twice as fast for that polar code. Decoding on 
the embedded processor offers very similar energy efficiency 
compared to the Intel processor although the data throughput 
is an order of magnitude slower. However, decoding on a GPU 
is significantly less energy efficient than any of the decoders 
running on a desktop processor. 

The power consumption on the embedded platform was 
measured to be fairly stable with only a 0.1 W difference 
between the decoding of polar codes of lengths 1024 or 
32,768. 

VII. Further Discussion 

A. On the relevance of the instruction-based decoders 

Some applications require excellent error-correction perfor¬ 
mance that necessitates the use of polar codes much longer 
than N - 32,768. For example. Quantum Key Distribution 
benefits from frames of 2^^ to bits [25]. At such lengths, 
current compilers fail to compile an unrolled decoder. How¬ 
ever, the instruction-based decoders are very suitable and are 
capable of throughput greater than 100 Mbps with a code of 
length 1 million. 
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Polar: -R=l/2 R=2/3 -+>- R=3/4 R=5/6 

LDPC: - R=l/2 R=2/3 R=3/4 R=5/6 


Fig. 9: Error-correction performance of the polar codes of 
length 2048 compared with the LDPC codes of length 1944 
from the 802.1 In standard. 

B. On the relevance of software decoders in comparison to 
hardware decoders 

The software decoders we have presented are good for 
systems that require moderate throughput without incurring 
the cost of dedicated hardware solutions. For example, in a 
software-defined radio communication chain based on USRP 
radios and the GNU Radio software framework, a forward 
error-correction (EEC) solution using our proposed decoders 
only consumes 5% of the total execution time on the receiver. 
Thus, freeing FPGA resources to implement functions other 
than EEC, e.g. synchronization and demodulation. 

C. Comparison with LDPC codes 

LDPC codes are in widespread use in wireless communica¬ 
tion systems. In this section, the error-correction performance 
of moderate-length polar codes is compared against that of 
standard LDPC codes [26]. Similarly, the performance of the 
state-of-the-art software LDPC decoders is compared against 
that of our proposed unrolled decoders for polar codes. 

The fastest software LDPC decoders in literature are those 
of [6], which implements decoders for the 802.1 In standard 
and present results for the Intel Core 17-2600 x86 processor. 
That wireless communication standard defines three code 
lengths; 1944, 1296, 648; and four code rates; 1/2, 2/3, 3/4, 
5/6. In [6], LDPC decoders are implemented for all four codes 
rates with a code length of 1944. A layered offset-min-sum 
decoding algorithm with five iterations is used and early- 
termination is not supported. 

Fig. 9 shows the frame-error rate (FER) of these codes using 
10 iterations of a flooding-schedule offset min-sum floating¬ 
point decoding algorithm which yields slightly better results 
than the five iteration layered algorithm used in [6]. The FER 
of polar codes with a slightly longer length of 2048 and 
matching code rates are also shown in Fig. 9. 

Table VIII that provides the latency and information 
throughput for decoding 524,280 information bits using the 
state-of-the-art software LDPC decoders of [6] compared to 


TABLE VIII; Information throughput and latency of the polar 
decoders compared with the LDPC decoders of [6] when 
estimating 524,280 information bits on a Intel Core 17-2600. 


Decoder 

N 

Rate ' 


Latency 

. Info. T/P 
(Mbps) 

total (ms) 

per frame (jis) 

[6] 

1944 

1/2 

17.4 

N/A 

30.1 



2/3 

12.7 

N/A 

41.0 



3/4 

11.2 

N/A 

46.6 



5/6 

9.3 

N/A 

56.4 

this work 

2048 

1/2 

2.0 

3.83 

267.4 



2/3 

1.0 

2.69 

507.4 



3/4 

0.8 

2.48 

619.4 



5/6 

0.6 

2.03 

840.9 


our proposed polar decoders. To remain consistent with the 
result presented in [6], which used the Intel Core 17-2600 
processor, the results in Table VIII use that processor as well. 

While the polar code with rate 1/2 offers a better coding gain 
than its LDPC counterpart, all other polar codes in Fig. 9 are 
shown to suffer a coding loss close to 0.25 dB at a FER of 
10“^. However, as Table VIII shows, there is approximately an 
order of magnitude advantage for the proposed unrolled polar 
decoders in terms of both latency and throughput compared 
to the LDPC decoders of [6]. 


VIII. Conclusion 

In this work, we presented low-latency software polar de¬ 
coders adapted to different processor architectures. The decod¬ 
ing algorithm is adapted to exploit different SIMD instruction 
sets for the desktop and embedded processors (SSL, AVX and 
NEON) or to the SIMT model inherent to GPU processors. 
The optimization strategies go beyond parallelisation with 
SIMD or SIMT. Most notably, we proposed to generate a 
branchless fully unrolled decoder, to use compile-time spe¬ 
cialization, and adopt a bottom-up approach by adapting the 
decoding algorithm and data representation to features offered 
by processor architectures. For desktop processors, we have 
shown that intra-frame parallelism can be exploited to get 
a very low-latency while achieving information throughputs 
greater than 1 Gbps using a single core. For embedded pro¬ 
cessors, the principle remains but the achievable information 
throughputs are more modest at 80 Mbps. On the GPU we 
showed that inter-frame parallelism could be successfully used 
in addition to intra-frame parallelism to reach better speed, 
and the impact of two critical parameters on the performance 
of the decoders was explored. We showed that given the 
right set of parameters, GPU decoders are able to sustain an 
information throughput around 1 Gbps while simultaneously 
decoding hundreds of frames. Finally, we showed that the 
memory footprint of our proposed decoder is at least an order 
of magnitude lower than that our the state-of-the-art polar 
decoder while being slightly more energy efficient. These 
results indicate that the proposed software decoders make 
polar codes interesting candidates for software-defined radio 
applications. 
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